Trajectory optimization of UAV-IRS assisted 6G THz network using deep reinforcement learning approach

Terahertz (THz) wireless communication is a promising technology that will enable ultra-high data rates, and very low latency for future wireless communications. Intelligent Reconfigurable Surfaces (IRS) aiding Unmanned Aerial Vehicle (UAV) are two essential technologies that play a pivotal role in balancing the demands of Sixth-Generation (6G) wireless networks. In practical scenarios, mission completion time and energy consumption serve as crucial benchmarks for assessing the efficiency of UAV-IRS enabled THz communication. Achieving swift mission completion requires UAV-IRS to fly at maximum speed above the ground users it serves. However, this results in higher energy consumption. To address the challenge, this paper studies UAV-IRS trajectory planning problems in THz networks. The problem is formulated as an optimization problem aiming to minimize UAVs-IRS mission completion time by optimizing the UAV-IRS trajectory, considering the energy consumption constraint for UAVs-IRS. The proposed optimization algorithm, with low complexity, is well-suited for applications in THz communication networks. This problem is a non-convex, optimization problem that is NP-hard and presents challenges for conventional optimization techniques. To overcome this, we proposed a Deep Q-Network (DQN) reinforcement learning algorithm to enhance performance. Simulation results show that our proposed algorithm achieves performance compared to benchmark schemes.

• UAVs-IRS trajectory design aided-THz wireless communication system, operating in the THz frequency band, where the direct path between BS-UEs is blocked is considered in this study.• A framework optimizing UAVs-IRS trajectory is introduced, aiming to minimize UAVs-IRS mission comple- tion time, taking into account energy consumption constraint of the UAVs-IRS.• A low complexity Deep Q-network (DQN) reinforcement learning based algorithm is proposed to finds an optimal low mission completion time and total energy consumption under various practical constraints such as flying time, hovering time, and the UAVs-IRS coverage area.• Simulation results demonstrate effectiveness of proposed algorithm compared to benchmark schemes in terms of completion time of trajectory, with many actual parameter settings.
This paper introduces an optimization strategy aimed to find an optimal trajectories of UAVs-IRS such that minimized total mission completion time of UAVs-IRS while satisfying the energy consumption constraint.The system model of the proposed algorithm is detailed in "System Model Design" section."Problem Formulation"

Network model
In this section, a downlink transmission UAVs-IRS aided THz wireless communication system is studied.We consider the downlink system model of a wireless network operating in the THz (0.3-10 THz) bands, to support outdoor applications, which consists of one BS operating in THz band that located at midpoint the urban area denoted by A .Also, number of Ground Users (GUs) are denoted by N which are uniformly distributed in the coverage area.While number for UAVs-IRS is denoted as M .The position of UAV-IRS j is dispatched from the midpoint area, and UAVs-IRS will hover for some time above each group of users at Hovering Points (HO), to cater the users within that group.We assume that G = G 1, . . .., G K where K is number of groups of users.It is assumed that direct links between BS and the GUs are obstructed by building or other obstacles environmental.The positions of the obstacles are assumed to be known, which is out of the scope of consideration in this paper.Communications between BS and GUs have been assisted by UAV-IRSs which are considered to be using Uniform Planar Array (UPA).The number of IRS reflective elements are I = I x I y where I x and I y are IRS reflecting elements numbers along the x-axis and y-axis respectively.Assume that rotary-wing UAV-IRS supports the flying hovering mode without taking acceleration and deceleration into account.Specifically, the UAV flies to the hovering points with a constant speed and hovers at these points with static status for servicing all users in groups.
Three-dimensional Cartesian coordinate system used to model UAV-IRS j in THz network, considered the coordinate for BS can defined as q B = x BS = 0, y BS = 0, h BS , where h BS is BS altitude and coordinates of user N is q n = (x GU n , y GU n , h GU n ) , UAVs-IRS j coordinates is q j = x UAVIRS j (t), y UAVIRS j (t), h UAVIRS j , respectively.The system architecture of UAV-IRS assisted THz wireless communication network is shown in Fig. 1.UAVs-IRS takes off from an initial point at the middle area G 1 , determines the location for hovering point at Group 1 G 2 that will be visited first.UAV-IRS repeats this procedure until servicing all users for all groups is completed, and flies back to G 1 .

Channel model
As the key to understanding the THz band and laying out fundamentals for communications network, THz multipath channel model characteristics are essential for system design and performance evaluation.Channel www.nature.com/scientificreports/models may include parameters such as sparse multipath loss, molecular absorption loss, diffuse scattering and specular reflection 30,31 .Consequently, channel transfer function is expressed as follows: where the term e −0.5k(f )d indicates the channel path-loss due to the molecular absorption and the concentration of water vapor molecules in the air 32 , and c is speed of light, f is the carrier frequency.The k f represents the absorption loss coefficient that is based on the transmission frequency and expressed as: where n, g is denoted the gases and all their isotopologue existing in air, respectively.The standard temperature is T stp , P stp denotes the standard pressure, T represents the temperature, and P is the pressure.The transmission environment is indicating as Q i,g , and σ n,g (f ) represents the number of molecules per unit volume at frequency f 28 .Euclidean distance from BS to the UAV-IRS and from UAV-IRS to GUs represents as where d 1 is distance between BS and first reflector of IRS, and d 2 is the distance between first reflector of IRS and GUs, can be expressed as follows:

Coherence time
Coherence time is a temporal measure of the channel's coherence.It is referring to measure of the time duration over which the wireless channel impulse response remains constant sufficiently for effective communication 33 .
In the THz frequency band, due to the extremely high frequency and dynamic environments, the coherence time is typically very short.Particularly, we would like to clarify our system model can be acquired with the longer coherence time, which increases the accuracy of the distance determination.To manage coherence time effectively in system model to ensure reliable communication, we utilize a fixed user in multiple UAV-IRS scenarios in THz network offers several advantages including stability, typically experiencing more stable channel conditions since they are not moving relative to the network infrastructure.

Transmission strategy
In the transmission strategy, UAV-IRS is used as the air relay to transmit information between BS and GUs, assuming that direct links are blocked due to blockage in the environment.The wireless signal is transmitted to the receiver through channel transfer function in THz band.
Additionally, the adaptability of UAVs-IRS allows them to relocate by sensing channel characteristics to improve actively channel conditions.Consequently, channel conditions are actively modified compared with fixed infrastructure communication.IRS is equipped with I x I y elements of phase shift to establish the reflection communication link among BS and GUs through regulating IRS phase shift.Thus, the channel gain of BS-UAV-IRS-GUs can be represented by 34 : where, is beamforming matrix of IRS which is the diagonal phase shift matrix, defined as: � = diag[w 1 e jφ 1 , w 2 e jφ 2 , . . ...w n e jφ n ] where w n ∈ [0, 1] is an amplitude reflection coefficient of the reflective element of IRS, φ ∈ [0, 2π] is the phase shift of IRS element, and j is a complex number's imaginary unit.
e t (t) indicates the transmit vector from BS to UAV-IRS and e r (t) is received vector from the UAV-IRS to GUs, respectively.Finally, the cascade channel of BS-UAV-IRS-GUs link as follows 35 : where, r t (t) is the transmission vector array from BS to the first component of UAV-IRS, and r o (t) is the transmis- sion vector array from first component of UAV-IRS to GUs.

UAV-IRS model
Without loss of generality, we assumed UAVs-IRS j fly with fixed altitude h , moving at a constant speed v in (m/s) and not exceeding maximum speed v max .At each time slot period δ t , it can move with the maximum distance as d max = v max δ t .In each time slot t , UAV-IRS will move with flying action determined by an angle of µ t ∈ [0,2π ], and a distance of d t ∈ [0, d max ].This work only addresses 2D trajectory optimization, as the environment is dominated by high-rise buildings that would be necessary long climbing phases to be overflown.Based on the movement of UAV-IRS, the position of UAV-IRS will change after each time slot t , and the moving distance will be calculated as follows: (1) where ϕ t is UAV-IRS direction at time slot t from UAV-IRS j to the users n .Also, we assumed that the trajectory of UAV-IRS j can visit all sequence of locations of stop points G j in Q j (T F j ) selected as: This paper assumes that the entire process of deployment UAVs-IRS is divide into T time slots of unequal length, and each time slot t with sufficient slot length δ t .For ease of exposition, we divide it into three compo- nents.So, the completion time of UAV-IRS trajectory is directly related to the summation of the flying time, the flying hovering, and operation time of UAV-IRS j .The overall completion time of UAV-IRS j is as follows: The first phase UAVs-IRS processing time T pro j Based on the group's current location and avoiding obstacles, UAVs-IRS calculates the next optimal location, which includes the planning duration time, which varies depending on the complexity of the mission, and the trajectory execution (trajectories are designed based on the specific requirements of a mission, taking into account factors such as optimal UAV-IRS location, hovering points changes).
The second phase UAVs-IRS flying time T F j Time required for UAV-IRS to fly from the present hovering point to the next new hovering point.In other words, we need to decide the visit order of the hovering point to minimize the flight distance of UAVs-IRS and thus minimize the flight time.Let G 1 and G K be the optimal starting and ending group positions of UAV-IRS respectively.Moreover, the order of visiting the hovering points is expressed as G 1 , . . ..G k and let the order set G to present UAV-IRS whole trajectory be: According to UAV-IRS speed v , it is possible to determine UAV-IRS flying Time T F j which can be represented as: where d G i − G j is the distance function that calculates the distance between hovering points G i and G j .
The third phase UAVs-IRS hovering time T hov j After UAVs-IRS selection of the hovering point of each group, and UAV-IRS visiting order to these hovering points.In this case, the hovering time of UAVs-IRS is the time it takes to associate with the users through communicating and collecting the data and serving users inside the groups.
Let T hov j be the hovering time that the users in the k-th cluster communicate with UAV-IRS, which can be expressed by: where D is total data transmitted for users, and R(t) is the achievable data rate for the downlink transmission from the UAVs-IRS j to users M at each time slot t is 36 : where ψ the network experiences interference, the channel gain is CH(t) , P j is transmit power from UAV-IRS to their GUs, B j is the total THz-bandwidth, and σ 2 is the additive white Gaussian noise (AWGN).
Then, k f the absorption loss coefficient, and the distance from BS-UAV-IRS-GUs represents d .The fol- lowing total mission completion time for UAVs-IRS j as follows: The UAV-IRS responds to environment and makes decisions swiftly, so the processing and decision time is negligible compared to the flight time and hovering time of UAV-IRS, the division of time slots is displayed in Fig. 2

Energy consumption model
There are many factors that are involved of UAV-IRS energy consumption such as the energy during operations such as flying energy to keep UAV-IRS mobile and hovering energy.Regarding for the power consumption of the UAVs-IRS, it mostly consists of the communication-related power consumption and propulsion power consumption.The former primarily consists mostly of signal, circuitry, reception, and processing consumption, which is negligible compared with the propulsion power consumption.Therefore, the propulsion power consumption measures the energy consumed to fly or hover over UAVs-IRS.In general, energy consumption depends not only on UAVs-IRS velocity but also on its acceleration and deceleration.It should be noted that ignored the energy consumption during the acceleration and deceleration of UAVs-IRS 37 , which makes sense in reasonably where the acceleration and deceleration speed or acceleration/deceleration duration is minimal 38 .Therefore, power consumption of rotary-wing UAV-IRS flying with speed v represents as: where P 0 and P 1 are blade profile power and induced hovering power state, respectively which are two con- stants linked to the physical parameters of the UAV-IRS.The tip speed of the rotor blade represents U , average rotor-induced velocity when hovering denotes v 0 , the fuselage drag ratio is d 0 , the rotor stiffness denotes as s, air density is ρ, and A represents the rotor disk area.The relevant parameters are explained in details in Table 1.By substituting v = 0 into (15), the hovering power consumption which is a finite value depending on the air density, aircraft weight, and rotor disc area, can be expressed as: As can be seen from ( 15) the propulsion power consumption of rotary-wing UAVs-IRS consists of three parts: induced, blade profile, and parasite power.The blade profile power and parasite power, increase cubically and quadratically, needed to control profile drag of blades and fuselage drag, respectively.However, induced power is needed to overcome the induced drag of the blades, that decreases with v.
(15)  www.nature.com/scientificreports/After that, the total energy consumption of UAV-IRS used to determine the general consumption of energy denoted by E during flight which is the sum of hovering energy E hov j and flying energy E F j , and we deploy more than one UAV-IRS, so the fixed cost (e.g., take off or land in, maintenance) may incur E pro j .Based on the above analyses, total energy consumption of proposed algorithm is expressed as: wherein energy consumption of UAV-IRS flying E F j given as follows: E F j = T F j * P F , where P F is the flight power of UAV-IRS during flight 14 , and T F j is the amount of time the drone spent flying overall.Moreover, E hov j the energy consumption of hovering time is given as follows: E hov j = T hov j * P hov , where P hov is the UAVs-IRS hovering power, and T hov j hovering time, respectively.

Obstacle model
In this paper, we assumed that the sizes and locations of the obstacles are known, since each blockage is represented as a rectangle with dimensions being the length, the width, and height of the obstacle, respectively.Obstacles can be measured by 3D point.In particular, we consider several obstacles, and their model are referred to in 39 .After that, GUs will be deployed randomly on the area size with these obstacles.

Area decomposition
In this subsection, large area is decomposed depending on the available number of UAV-IRS.Aim to prevent UAVs-IRS from under-collision and avoid obstacles in environment, and balance the energy consumption for different UAVs-IRS.In order to achieve better coverage and improve the system's performance, the area is divided into various groups.As mentioned previously the number of UAVs-IRS is not predefined in the system.Inspired by the work in 37 , a modified k-means algorithm was used for divide users into various groups that are classified based on shortest distance criteria that allocate users to groups and can apply data rate constraints in accordance with predetermined requirements.

Problem formulation
The objective function of this work is to find the optimal UAVs-IRS trajectory that minimizes the total mission completion time of UAVs-IRS while satisfying energy consumption constraints.Mission completion time includes the time taken by UAVs-IRS flying time, hovering time, and operation time.refers to the trajectory time taken by all UAVs-IRS to complete respective trajectory tasks.To maintain THz wireless communication system, our optimization problem can be formulated mathematically as a constrained optimization problem, where the objective function represents the total mission completion time of UAV-IRS and the constraints considering represent the energy consumption of UAVs-IRS that can improve the performance of the system.Regarding the optimizing trajectory problem, the UAVs-IRS seeks to find the optimal trajectory in minimum mission completion time is denoted as T j .Therefore, the optimization problem can be formulated as the following: where constraints (18a) indicate that the maximum distance travelled by the UAV-IRS j cannot exceed the maxi- mum speed V max .Constraints (18b) the total mission completion time of UAV-IRS T j cannot be greater than T max .Constraint (18c) indicates the transmission energy of the UAV-IRS has to be less than the maximum available energy.Constraint (18d) denotes that to keep a minimum secure distance of any two UAVs-IRS (to satisfy the collision avoidance.(18g) specifies that the movement direction of UAV-IRS between (0, 2π).While constructing UAVs-IRS trajectories Q j (T F j ) we have to consider the deployment optimization of UAVs-IRS stop points, and the number of UAV-IRS to find optimal trajectory time of UAV-IRS with minimum energy consumption.

Framework of the proposed algorıthm and policy
The proposed algorithms are described in detail in the following subsections.The problem formulation of the proposed UAV-IRS trajectory algorithm which is an NP-Hard problem.The objective is to cover the area by passing through all hovering point locations, and minimizing energy consumption.In order to handle this problem in a tractable manner, we proposed DQN based algorithm and an Iterative heuristic algorithm for UAV-IRS Trajectory planning.
This section, will explain this DQN algorithm and heuristic approach to solve the formulated problem.

Deep Q-network (DQN) algorithm
This subsection introduces the proposed DQN algorithm and optimization policy framework.According to the above system model, we design the trajectory UAV-IRS assisted THz communication system as an environment.
Finding the optimal solutions in the dynamic environment is an effective dynamic programming techniques to solve the sequential decision-making problems.According to DQN proposed UAV-IRS assumed as agent is employed for interacting with the environment and approximate an optimal action-value function for maximizing the accumulated rewards within a sequence of states 40 .One of the popular Reinforcement Learning methods that combines Q-learning and neural networks is called DQN.Furthermore, DQN method has ability to handle non-linear function approximation and high-dimensional state spaces in complex environments with various obstacles.In order to improve the training effect and convergence rate DQN-based algorithm, we proposed to apply the iterative updating process and the Experience Relay (ER) method to correlations with the target by matching the action-values to periodically updated target values.And it improves data distribution by removing the correlation between observation states and random batch selection of samples 41 .

DQN algorithm processing framework
In the DQN-based algorithm framework consists of state s t , the action a t , and the reward r t , action selection by an agent UAVs-IRS is accompanied by exploration and exploitation to obtain the highest reward even though updating current states.We initialize evaluation network, target network, and experience replay memory C from Lines 1 to 2. In this paper, during every episode, with the state variable s t is first initialized.Then the agent uses ε-greedy policy find balance between exploitation and exploration, which introduces parameter ε with a small value to generate and chooses the action a t .Specifically, the agent randomly selects from A with probability 1-ε or chooses at that has the maximum Q value and explores the environment with probability.UAVs-IRS energy consumption is determined using Eq. ( 17).The reward is then obtained using Eq.(21).In line 10 transition will be stored in experience replay memory.From Line 11 the learning process starts by selecting randomly sampling M transitions from memory to train evaluation network.Evaluation network parameter updated by Eq. (22).Finally, there are periodic updates to target network as well.Therefore, proposed DQN algorithm, which is described in details in Algorithm 1.The remaining corresponding elements, the state, action, and rewards, are clearly explained as follows: State space.The state of agent has: the location of UAV-IRS (stop point current coordinate), we describe state space of all agents' states for each episode are defined as Action space.In this model, according to the observed state s t (t) , the agent uses discrete actions space to learn from the environment that maximizes the Q value.The agent (UAV-IRS) choosing a proper flying direction and arrives the next location to obtain a better reward.For each taken action, we assume that the agent UAV-IRS moves with a constant speed and the same altitude.More specifically, we assume the possible actions for the agent UAV-IRS are containing eight directions {forward, backward, left, right, northeast, northwest, southeast, and southwest}.All UAVs-IRS take actions one after another in a sequential manner.A large set of actions might be taken into account in this work but for simplicity purposes and without loss of generality, we maintain the most necessary actions to choose best UAVs-IRS flying direction in order to get greater rewards based on the current state s t (t) the next state s t (t + 1) which is trained by DRL-DQN algorithms.Thus, the action space a t (t) is given by: To be specific, indicates UAV-IRS remains still.Where ϕ t (t) is the direction of the UAV-IRS at time slot t from UAV-IRS j to the users m , t(t) is the time that the UAV-IRS moving at time slot t , respectively.Reward function.The reward function determines the objective of the DRL problem, which is depending on the state s t and action a t .In addition, the objective function of P1 is to minimize the mission completion time of UAV-IRS trajectory, while the objective of DRL (DQN) is to take an appropriate action to maximize the reward (19)   www.nature.com/scientificreports/ in any state s t .Any action taken by the agent must satisfy the constraints specified in (18).If these restrictions are not adhered to, a penalty value is applied to the reward function.Thus, as the reward function, the agent will learn to maximize this value, which effectively minimizes the completion time is given by: Motivated by the work presented in 42 , we provide here DQN-based algorithm for UAVs-IRS trajectory optimization.In DQN, UAV-IRS is controlled by an agent to interact with environment.It is assumed that are two DNNs named evaluation network and target network.It should be noted that even if the evaluation network has same structure as target network, but extremely periodically updates.Initially, the agent receives the state from the environment and forwards it to the evaluation network, that generates Q-values Q(s t , a t ) relevant to each action.The action is generated using greedy policy based on the Q-values 43 .Subsequently, the environment provides reward.The action space contains 8 elements which are varying from (A0, A1, A2, A3, A4, A5, A6, A7).To be specific, A0 means UAV-IRS flies forward, A1 implies UAV-IRS flies backward, A2 stands for that UAV-IRA turns to the left, A3 signifies that UAV-IRS flies towards right, A4 means the UAV-IRS flies northwest, A5 represents the UAV-IRS flies northeast, A6 indicates the UAV-IRS files southwest, and A7 represents UAV-IRS files southeast.Then, the transition is stored into an experience replay memory and transition consists of (s t , a t , r t , s t+1 ) The learning process procedure starts when there are sufficient transitions in experience replay memory.In order to train the evaluation network, M transitions are randomly sampled in a mini-batch.The loss function for the evaluation network updating can be determined given the maximal Q-values max Q(s t `, a t `) from target network and the Q-values Q ( s t , a t ) from evaluation network.This can be expressed as: where δ i denotes parameter of DNN, and i represents index of iterations.Algorithm 1: Proposed DQN algorithm for multi UAV-IRSs trajectory (21)

Iterative heuristic algorithm for UAV-IRS trajectory planning algorithm
In this section, the proposed iterative heuristic approach for UAVs-IRS trajectories algorithm framework is introduced.In order to optimize all UAVs-IRS trajectories, we proposed UAV-IRS trajectory optimization for Flying into Group ( FIG ), Hovering point ( HO ), and Flying out Group ( FOG ) in area of GUs.
Coordinates of passing points of UAV-IRS in the serving users in the area are expressed as: Initially, from FIG j to HO j and HO j to FOG j , UAVs-IRS fights to the hovering point HO j to serving all users inside Group.During the serving all users, the trajectory UAV-IRS flying from point FIG j , passes through hover- ing point HO j , and hovers at it for a while to serving all users inside Group, then flying out point FOG j .
Based on the above analyses, we optimize UAV-IRS trajectory with the given UAV-IRS flying speed v .Therefore, the total mission completion hovering time of UAV-IRS on the trajectory can be expressed as: where t F in HO j indicates flying time of UAV-IRS from FIG j to HO j , t F ou HO j denotes flying time of UAV-IRS from HO j to FOG j , and t hov HO j indicates hovering time of UAV-IRS at HO j .
Algorithm 2: Heuristic Optimization Algorithm of UAV-IRS Trajectory

Time complexity analysis
In this sub-section, we investigate the complexity of proposed framework, which is composed of DQN algorithm and the heuristic approach.The time complexity analysis is described as: Time complexity for heuristic approach is determined by O(K × I), where K represents the number of UAV-IRS stop points and I denote the maximum iteration numbers of the heuristic.Time complexity of DQN algorithm: Algorithm 1 demonstrated after T time slots and M iteration rounds, DQN algorithm neural network parameters converge and tend to be stable.Number of operations of the network model is represented by the time complexity of a neural network and is determined by the number and layers for neurons in every layer of the network as well as the dimensions of the input state and action respectively.Number of layers of DQN neural network is expressed by L , and number of neurons in layer l represented by ( www.nature.com/scientificreports/u l .Operation times of DQN neural network in each time slot can be represented by L l=0 u l u l+1 .Thus, time complexity of DQN algorithm is O( M T L l=0 u l u l+1 ).

Simulation results and discussion
This section presents and verify the performance of proposed algorithm for outdoor environment scenario.Simulation was executed in Python programming language.Proposed DQN algorithm uses the same fully connected neural network structure which includes input layer for status information, an output layer that outputs the optimal action, two layers that are hidden in simulation using the function of activation, and modular normalization components.Additionally, to train policy network and Q-network using Adam Optimizers, and zero-mean normal distribution should be used to randomly initialize the values.It is important to note that simulation results shown were averaged over 500 independent iterations.Furthermore, we considering the urban scenario with area 50 m × 50 m, including one BS located in middle area which can serving 50 GUs.The locations of the GUs are uniformly distributed within the circle coverage area with radius r = 6 m, in each episode, the GUs locations assumed to be fixed.To avoid path loss peak, THz carrier frequency and THz transmission bandwidth operating in system model are set as f =0.3 THz and B = 10 GHz, respectively.Other parameters are shown in Table 1 26 , and the settings parameters of network model are given in Table 2 according to 42 .
To evaluate the proposed UAVs-IRS trajectory algorithm performance, we compare and implement the following three benchmark schemes in a clear manner.Then we demonstrate the performance metrics: the mission completion time of UAV-IRS, the energy consumption, the average trajectory time, and the average throughput, for different UAV-IRS trajectory designs.All algorithms are deployed in the same environment for comparison and results are shown in the figures below: Proposed algorithm: To evaluate how the proposed algorithm the optimal performance, we use DQN algorithm to get the optimal solution that can finds the optimal UAV-IRS trajectory with minimum mission completion time of UAV-IRS in THz network and minimize the energy consumption.
The Heuristic approach: In this scheme while searching for UAV-IRS trajectory, all feasible UAVs-IRS trajectories are listed and the optimal set of trajectories and usually solution is chosen the shortest trajectory to the target.
Random trajectory algorithm: This scheme is selected as an optimal UAV-IRS trajectory design, and each time UAV-IRS randomly selects the direction.
PPO algorithm: This algorithm is continuous action spaces, and typically find an iteratively trajectories using the current policy, estimating advantages for every state-action, and updating the policy parameters to maximize expected rewards while ensuring that the policy changes don't affect stability too much, to maximize expected rewards.
Figure 3, it is observed that UAV-IRS trajectory after training using the DQN algorithm, the model is trained on 10 groups of users in THz networks.Number of UAVs-IRS is three and UAV-IRS leaves from the initial location (25, 25).50 users are random uniformly distribution in the area of size is at 50 m × 50 m.Additionally, we can observe that the Y-shaped marks of different colors represent the users which are distributed into different groups, the yellow of each circle represents the position of the optimal location of the group.The blue rectangles represent obstacles in simulation, and the line segments represent the paths taken by UAVs-IRS, the diamondshaped represent the start of trajectory and the pentagrams the end of trajectory.
Finally, it is clear that no collision is detected with static obstacles is found throughout trajectories, and UAVs-IRS can weave effectively around obstacles.We utilize our improved DQN proposed algorithm to achieve the optimal trajectories for UAVs-IRS while minimizing the missions time of all UAV-IRS and minimum energy consumption.Hence, the final three trajectories of UAVs-IRS in this scenario model: the trajectory of UAV1-IRS represents red line {G0, G5, G4, G10}, the trajectory of UAV2-IRS represents blue line {G3, G6, G2, G1} and the trajectory of UAV3-IRS represent as green line {G7, G8, G9}.
Figures 4 and 5 respectively show average UAV-IRS trajectory time (seconds) and the average throughput with different trajectory designs.As the number of users changes, the trend of average trajectory time is shown in Fig. 4. To evaluate performance of the different algorithms clearly, we define the average trajectory time as the summation of both flying time and hovering time, that can be given as: Parameters settings for network model 42 .www.nature.com/scientificreports/ In this comparison, we utilize two benchmark algorithms as the comparison algorithms namely, heuristic algorithm based on the distance and PPO algorithm.It can be observed from Fig. 4 that proposed algorithm imposes the largest trajectory time as the increase of number of users.The PPO and heuristic algorithms consume much less UAV-IRS trajectory time compared with proposed algorithm.From practical aspects, the proposed algorithm is preferred with a relatively low complexity while achieving good performance.From practical aspects, the proposed algorithm is preferred with a relatively low complexity while achieving good performance.Nevertheless, the heuristic algorithm provides near-optimal performance and consumes less UAV-IRS trajectory time than the proposed algorithm, but the high computation complexity of the heuristic algorithm might limit its potential in practical scenarios.However, it can be observed that PPO algorithm may require more iterations to update its policy, especially in continuous domains that often learns more complex policies, to provides optimal converge solution.
Furthermore, we also compare the average throughput of proposed algorithm, heuristic algorithm, and PPO algorithm in Fig. 5.

The average throughput
In our scenario, the average throughput is the summation of the UAV-IRS payload data per trajectory/ trajectory time in the combined scenario divided by the total number of UAVs-IRS, which can be determined as follows: Figure 5 evaluates the average throughput (Gbps) of all algorithms as a function of the number of users.We also present the case random trajectory optimization, PPO algorithm and heuristic algorithm for the comparison.

The average throughput =
UAVs − IRS payload data per trajectory/ trajectory time total number of UAVs − IRS www.nature.com/scientificreports/Meanwhile, it has been observed in all schemes that increasing the number for users results in a higher overall average throughput.Specifically, the average throughput value of the proposed algorithm is very close to that of PPO algorithm.This is due to the fact that the PPO algorithm is continuous action spaces, that means it can handle the wide range actions and find all possible iteration (all available trajectories) to find an optimal solution without discretizing them.Furthermore, the heuristic algorithms assign the final path of UAV-IRS and always select the shortest path of UAV-IRS.This is normal and understandable in local path planning algorithms regardless of the increasing energy consumption than in our proposed algorithm designs.Another observation from Fig. 5 once number of GUs is 80, the proposed algorithm may achieve an average throughput of 24 Gbps.Therefore, when compared with another approaches, we can reach 26 Gbps as compared to heuristic algorithms, and by 25 Gbps at PPO algorithms.This is due to signal-to-noise ratio is significantly higher at THz frequencies because of the high path-loss characteristics of terahertz channels, leading to low interference among users.Although, performance of proposed algorithm is significantly inferior to that of heuristic algorithm, our computational complexity is O( M T L l=0 u l u l+1 ) , that is lower than heuristic approach.As a result, we achieved an appropriate trade-off between computational complexity and acceptable performance.
To investigate the effective mission completion time for UAV-IRS trajectories assisted THz network.It can be observed, that mission completion time defined as the trajectory time consumption of UAV-IRS and related to the flying time, hovering time, and processing time given in Eq. (28).Specifically, when UAV-IRS travelling from the first visited hovering point in a group to end optimal location in group, i.e., in completing its serving users task inside the area.
In Fig. 6, we further compare the effective mission completion time of UAV-IRS for all algorithms, random trajectory optimization, PPO algorithm, and heuristic algorithm, for different values of users, when employing IRS with 64 elements, and set other parameters the same as those set in Fig. 5.It can be observed from the figure that with the increasing number of users, our proposed scheme achieves better results, the advantages of our scheme become more obvious, we optimized the trajectory of UAVs-IRS to minimize mission completion time of the system.This is expected since when we utilizing the location information for the optimal hovering points more wisely is beneficial for UAVs-IRS trajectory design, since it becomes time not wasteful for UAV-IRS to visit all the hovering points and find the optimal trajectory.When the number of users is large, the trajectory mission completion time is higher after PPO algorithm.This is because the agent UAV-IRS in PPO algorithm has to investigate a wider diverse of states and needs to explore more to find the optimal policies for each.This exploration may slow down the convergence towards the best solution, leading to longer mission completion time.However, random trajectory optimized is the high mission completion time, which can cause UAVs-IRS to waste a lot of time collection in transmission area.In addition, we considered the nearly passive IRS phase shift which adopts a combination of some passive reflecting elements and active reflecting elements in sequence, to reflect more signal paths, and elements of IRS to increase SINR at the cost of a slight performance loss.
Additionally, we study the impact for different number of UAVs-IRS, on mission completion time, number of GUs was set 50 and we compared the other algorithms.Number of UAVs-IRS changes, mission completion time of UAVs-IRS is shown in Fig. 7. Obviously, increasing the number of UAVs could enhance the speed of monitoring area and serving of all users, and mission could be completed in less time.It can be seen from Fig. 7, when number of UAVs increases, mission completion time significantly reduces, compared with other algorithms our algorithm has achieved better results.This is because we consideration the energy constraint for each UAV-IRS in trajectory planning, and considers mission requirements to deploy an appropriate number of UAVs to complete the trajectory mission.However, increasing number of UAVs-IRS in the system may brought overhead, and reduce the energy efficiency.Figure 9 illustrates mission completion time as the area sizes changes.Under different area size, we set number UAVs-IRS = 3 and the number for GUs at 55 and compared between algorithms.It was observed, that for larger network size area mission completion time was gradually increasing.
That was expected, since the larger area size caused a long flying time, UAVs-IRS could obtain more time to serve all users in the a large area size and the energy consumption is increased.We can observe that, when the mission area size is large, mission completion time of proposed algorithm is also lower than other algorithms.This is because the proposed solution in this paper allows UAVs-IRS to take full advantage of its advantages by decomposed the area is into various groups and choosing the optimal hovering points.Thus, UAV-IRS can fly less distance and serving all users more quickly, that allows UAV-IRS to have more energy to fly faster, to speed up the mission completion time at cost for consuming more energy of UAVs-IRS.We can observe, the highest mission completion time was achieved by the heuristic approach, where consumes a lot of energy.This is due to the high computation complexity of the approach might not adapt well to use in realistic scenarios.
Figure 10, we test the proposed algorithm performance by evaluating number for UAVs-IRS versus the number of GUs.All results are derived by the same network configuration and results of more than 100 runs were averaged.Furthermore, number for UAVs-IRS required by all algorithms increases gradually with increasing number of GUs.This is due to the fact that more UAVs-IRS will be needed to provide wireless coverage for  increasing the number of users (dispersed in multiple groups in the area).Proposed algorithm obtains better performance, we deployed UAVs-IRS in the optimal locations in the group to hover for some time where the user's density is high.The selections are the best hovering locations in terms of the average distance of UAV-IRS to GUs to provide better service all users.Thus, we expect that proposed algorithm can reduces the required number of UAVs-IRS to serve GUs.This figure illustrates the significant savings in number of UAV-IRS with reduction of up to three UAVs-IRS for the scenario under evaluation when compared to alternative approaches.

Energy consumption
Energy consumption is recognized as a crucial metric for performance evaluating for algorithms across various domains and applications.It illustrates how ability proposed approach can minimize energy consumption and ensure safety by avoiding collisions with obstacles.
In order to analyze the proposed algorithm that contributes to maximizing energy saving, energy consumption for proposed algorithm and benchmark schemes for various numbers of users is shown in Fig. 11.It can be easily inferred that energy consumption of the three methods goes up simultaneously with an increase in number of GUs, we find that proposed algorithm is always lower energy consumption than PPO algorithm.It can be observed that the heuristic approach illustrates the highest energy consumption, the reason is that heuristic approach obtains the solution by looking to finds an optimal trajectory, but it needs longer time to generate the solution and consume more energy.Therefore, the proposed algorithm will be achieved the best solution for finding the optimal trajectory for UAV-IRS with maximizes the energy saving of UAV-IRS.Thus, in proposed algorithm we using flying at maximum speed v max which can save more time to power more GUs, and can achieve the optimization goal of UAV-IRS trajectory to minimize total energy consumption.Figure 12 shows the accumulated rewards of proposed algorithm, DQN, Heuristic approach and PPO algorithm.Initially, the proposed algorithm converges around 3700.PPO algorithm converges about 10,000 episodes, and the final reward value is 3600.Heuristic algorithm converges to 3100 around 12,500 episodes.Nonetheless, it can be seen that proposed algorithm achieves the largest average reward value and the fastest convergence speed, indicating the action selection strategy based on ε-pseudo count can more efficiently realize agent selection and utilization actions.Furthermore, performance gaps between proposed algorithm and benchmark algorithms are significant and keep increasing with iterations, that verifies proposed algorithm effectiveness.
Figure 13 shows the details of our proposed work by adjusting UV-IRS trajectory with fixed altitude and optimized altitude in terms of iterations.All results are derived by the same network configuration, we set number for GUs = 70, the altitude of UAV-IRS is constrained within the range from (50 m to 90 m).It can be seen from Fig. 13, when we used a constant altitude h i =50 m which is very close to the obtained value of optimal altitude is h i = 58 m in 500 iterations.However, we find out that UAV-RS is adjusted to relatively low altitude.This is because the amount of the propulsion energy is not sufficient to attract the UAV-IRS to higher altitudes.
In Fig. 14, we further compare the effective mission completion time of UAV-IRS for all algorithms, random trajectory optimization, PPO algorithm, and heuristic algorithm, for different values of users, when utilized the fixed altitude and optimal altitude.The optimal altitude is automatically calculated by DQN proposed algorithm.After that, it is applied to UAV-IRS, and UAV-IRS is positioned at the calculated optimal altitude.According to the time of calculating the optimal height, UAV-IRS instantly positioned in the designated group at optimal altitude h i = 58 m.Finally, while planning and optimizing 3D trajectories into UAVs-IRS operations in THz network offers many benefits, like in terms of flexibility and can improved coverage.However, 3D trajectory of UAV-IRS in THz network presents additional issues compared to 2D scenario, due to characteristics propagation of THz band which is very sensitive to blockage and attenuation by obstacles, and can affect to maintaining reliable communication links with UAVs-IRS at varying altitudes, which requires more energy propulsion, and add more complexity.Thus, these factors need to be carefully considered and addressed in a more detailed manner in the future work, to ensure the optimal 3D trajectory of UAVs-IRS varying altitudes with low energy consumption in THz networks.

Conclusion
This paper proposed framework for designing an optimal trajectory for UAVs-IRS algorithm assisted -THz wireless network.Accordingly, we proposed algorithm by optimizing trajectory of UAV-IRS, while ensuring UAV-IRS can successfully upload data rate from users inside groups with the remaining energy to minimize

Methods
To evaluate performance of proposed algorithm, extensive simulations are conducted.Simulation was performed in Python 3.7 and Tensorflow 1.15.0.In order to implement proposed DQN algorithm, we used AdamOptimizer that can update evaluation network with the rate 0.001.Also, we deploy two fully-connected hidden layers [256, 256] neurons and with 300 iterations target network updated.Mini-batch size, and experience replay memory are 10,0000 and 128 respectively.Every episode of the model training phase starts with initialized the task environment is randomly, that includes the locations of entities like UAVs-IRS, users, and obstacles that need to be avoided.Every episode is terminated when successfully finds the target with the highest reward, UAV-IRS runs out of battery UAV-IRS hits an obstacles.

Figure 1 .
Figure1.The illustration of system model.

Figure 2 .
Figure 2. Diagram of time slot division.
Flying time and hovering time of UAV-IRS affected by the FIG , FOG and HO .Algorithm 2 describes overall architecture of the proposed iterative heuristic algorithm to minimize mission completion time of UAV-IRS.The flying distance of UAV-IRS between FOG j−1 and FIG j defined as dm 1 .The shortest distance and longest distance between FOG j−1 and FIG j represented as d 1min and d 1max respectively.Also, dm 2 denote flying distance of UAV-IRS between FOG j and FIG j+1 .Similarly, the shortest distance and longest distance between FOG j and FIG j+1 are d 2min and d 2max , respectively.UAV-IRS needs to decelerate from the FIG to HO and accelerate from HO to FOG , respectively.

Figure 3 .Figure 4 .
Figure 3. UAVs-IRS trajectory obtained by DQN.The illustration of system model.Each Group of users appears in Y-shape, and optimal location of UAV-IRS is yellow circle, respectively.

Figure 5 .Figure 8
Figure 5.The average throughput versus number of users.

Figure 6 .
Figure 6.The mission completion time versus number of users.

Figure 7 .
Figure 7.The mission completion time versus number of UAVs.

Figure 8 .
Figure 8.The mission completion time versus speed of UAVs-IRS.

Table 3 .
Energy-saving ratios for the approaches when number of GUs increases from 40 to 90.